Goals for today:
Why start with binary GLMs? Because these are the simplest of all GLMs. Not too much can go wrong with them. In future sessions we will work with other types of GLMs for which we will need to learn some more concepts and be more careful with assumptions.
Today we will need the following packages:
library(tidyverse) #including ggplot2
library(ggbeeswarm)
library(ggResidpanel)
library(glmmTMB)
library(DHARMa)
library(emmeans)In addition, if you want to draw the diagrams representing our models, you can install:
library(DiagrammeR)We can draw a simple linear model like:
grViz("
digraph boxes_and_circles {
node [shape = square, color = DeepSkyBlue]
response; predictor
node [shape = diamond, color=ForestGreen]
intercept; effect
node [shape = circle, color = black]
'expected value';
'expected value' -> response [ label = ' + rnorm(0,\U03C3)', style=dashed, fontcolor=ForestGreen];
intercept -> 'expected value';
predictor -> effect [arrowhead = none, label = ' \U00D7'];
effect -> 'expected value' ;
}")grViz("
digraph boxes_and_circles {
node [shape = square, color = DeepSkyBlue]
response; predictor
node [shape = diamond, color=ForestGreen]
'rnorm() noise'; effect
node [shape = circle, color = black]
'expected value';
'expected value' -> response;
predictor -> effect [arrowhead = none, label = ' \U00D7'];
effect -> 'expected value' ;
'rnorm() noise' -> response
}")Or if you prefer, we can write an equation:
\[ response = intercept + effect \times predictor + residual \text{, with } residuals \sim Normal(0, \sigma) \] First, let us look at an example when a linear model performs fine. We simulate data following all the assumptions of a linear model with equation
First, we choose parameter values for the equation:
number_of_observations <- 30
intercept <- 1
effect <- 0.5
sigma_residual <- 1Then we draw random numbers for the predictor (for which we don’t have an equation, so we could do whatever) and for the response according to the equation:
set.seed(934)
#
# predictor <- rnorm(n = number_of_observations, mean = 0, sd = 1)
#
# response <- intercept + effect*predictor + rnorm(n = number_of_observations,mean = 0, sd = sigma_residual)
data_linear <- tibble(predictor = rnorm(n = number_of_observations, mean = 0, sd = 1),
response = intercept + effect*predictor + rnorm(n = number_of_observations,mean = 0, sd = sigma_residual))Let us visualize these data
(p <- ggplot(data_linear, aes(x=predictor, y=response))+
geom_point())We fit a simple linear model and visualise the model prediction:
lm0 <- lm(response~predictor, data = data_linear)
summary(lm0)##
## Call:
## lm(formula = response ~ predictor, data = data_linear)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3702 -0.5418 -0.1358 0.5766 2.1599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.7987 0.1586 5.037 2.51e-05 ***
## predictor 0.5990 0.1706 3.511 0.00153 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7867 on 28 degrees of freedom
## Multiple R-squared: 0.3057, Adjusted R-squared: 0.2809
## F-statistic: 12.33 on 1 and 28 DF, p-value: 0.001533
(p <- p + geom_smooth(method="lm"))## `geom_smooth()` using formula 'y ~ x'
Next we draw the residuals:
p + geom_segment(aes(x=predictor, y=response, xend= predictor, yend=lm0$fitted.values))## `geom_smooth()` using formula 'y ~ x'
Residuals look mostly unstructured, random and normally distributed. We can better assess these properties using diagnostic plots showing different properties of the residuals:
resid_panel(lm0)Diagnostic plots may not look very good, but there is actually no serious violation of assumptions. That looks like an acceptable linear model.
Now let’s generate data with the same structure, except we convert the continuous values to binary values, following a logistic transformation and a Bernoulli distribution. For this type of data we generally want to model the probability of a 1 (survival, presence, success…). Data can be either 0 or 1. Although we cannot observe values between 0 and 1 (e.g., 0.5) in the data, we can interpret such a value: it is a probability (of observing a 1). On the other hand, values below 0 or above 1 are meaningless for a binary variable.
grViz("
digraph boxes_and_circles {
node [shape = square, color = DeepSkyBlue]
response; predictor
node [shape = diamond, color=ForestGreen]
intercept; effect
node [shape = circle, color = black]
'expected value';
'expected value' -> probability [ label = ' inverse logit'];
probability -> response [ label = ' rbinom(size=1, p=probability)', style=dashed];
intercept -> 'expected value';
predictor -> effect [arrowhead = none, label = ' \U00D7'];
effect -> 'expected value' ;
}")Or if you prefer, we can write an equation:
\[ response \sim \mathrm{Bernoulli} (\mathrm{logit}^{-1} (intercept + effect \times predictor) ) \]
set.seed(935)
number_observations <- 30
intercept <- 1
effect <- 2
predictor <- rnorm(n = number_observations,mean = 0,sd = 1)
latent <- intercept + effect*predictor
probability <- plogis(latent)
response <- sapply(probability, FUN=function(x){rbinom(1,1,x)})
data_binary <- tibble(predictor=predictor, response=response)
ggplot(data_binary, aes(x=predictor, y=response))+ geom_beeswarm(groupOnX = FALSE)Let’s fit a linear regression to these data and visualise the result.
lm1 <- lm(response~predictor, data = data_binary)
summary(lm1)##
## Call:
## lm(formula = response ~ predictor, data = data_binary)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.67756 -0.42221 0.04016 0.36265 0.62673
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.62268 0.07761 8.024 9.75e-09 ***
## predictor 0.28967 0.08316 3.483 0.00165 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4236 on 28 degrees of freedom
## Multiple R-squared: 0.3023, Adjusted R-squared: 0.2774
## F-statistic: 12.13 on 1 and 28 DF, p-value: 0.001647
ggplot(data_binary, aes(x=predictor, y=response))+
geom_smooth(method="lm", fullrange=TRUE) + geom_point(alpha=0.5) +
geom_segment(aes(x=predictor, y=response, xend=predictor, yend=lm1$fitted.values), alpha=0.4) +
xlim(c(-3,1.2))## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 3 rows containing non-finite values (stat_smooth).
## Warning: Removed 3 rows containing missing values (geom_point).
## Warning: Removed 3 rows containing missing values (geom_segment).
Several problems are apparent:
In addition, if we look at the diagnostic plots:
resid_panel(lm1)That is not a good linear model. Inference from such a model is unreliable. In particular you should not trust any measure of uncertainty (Standard Error, Confidence Interval, p-value), and you should certainly not trust extrapolation.
Instead of a linear model, here we need to use a Generalized Linear Model (GLM). GLMs come in different flavours or families, each adapted to different types of response variables. Here we will talk about GLMs for binary data, which belong to the binomial family.
Before discussing what a GLM is in details, let’s see that it is very easy to fit one in R. We can use the function glm(), which works exactly like lm(), except that you need to specify a family:
glm1 <- glm(response~predictor, data = data_binary, family = "binomial")
summary(glm1)##
## Call:
## glm(formula = response ~ predictor, family = "binomial", data = data_binary)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6327 -1.0134 0.3569 0.8682 1.5313
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.6969 0.4720 1.476 0.1398
## predictor 1.7407 0.7132 2.440 0.0147 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 40.381 on 29 degrees of freedom
## Residual deviance: 29.754 on 28 degrees of freedom
## AIC: 33.754
##
## Number of Fisher Scoring iterations: 5
The summary output looks very similar to that of lm(). From a quick look we see that the predictor has a positive effect on the response, as expected.
ANOVAs are difficult with GLMs, but we can still have a look to confirm that there is good evidence that the predictor has an effect on the response. We just need to specify test=Chisq.
anova(glm1, test = "Chisq")## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: response
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 29 40.381
## predictor 1 10.627 28 29.754 0.001114 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let’s see what the model prediction looks like:
(p <- ggplot(data_binary, aes(x=predictor, y=response))+
geom_smooth(method="glm", method.args = list(family="binomial"), fullrange=TRUE) +
geom_point())## `geom_smooth()` using formula 'y ~ x'
That looks good!
What happens to the prediction when we extrapolate beyond the range of \(x\) values?
p + xlim(-10,10)## `geom_smooth()` using formula 'y ~ x'
Two very good points:
Two of the three problems we saw with lm() are fixed. What about the third one, the distribution of residuals?
p + geom_segment(aes(x=predictor, y=response, xend=predictor, yend=fitted(glm1)))## `geom_smooth()` using formula 'y ~ x'
resid_panel(glm1)The residuals do not look much better with the glm() than with the lm(). But that is okay! GLMs do not have the same assumptions as LMs. In fact, for binary binomial GLMs there are very few assumptions. The only one you need to remember for now is that observations are assumed to be independent of each other conditional on predictors, or in different words, your model should not leave important blocking factors unaccounted for.
When working with GLMs, the function resid_panel() is useless.
What to do then? The basic idea is to simulate fake data from the model we fitted and compare them to real data. If the model is a good fit to the data, then the simulated data should look like the real data.
For instance, let’s visualise the mean and standard deviation of simulated versus real data
real_properties <- data_binary %>%
summarise(mean=mean(response), sd=sd(response)) %>%
pivot_longer(cols = c(mean, sd), names_to = "property")
simulated_properties_glm <- map_dfr(.x = 1:1000, ~{simulate(glm1)}, .id = "simulation" ) %>%
group_by(simulation) %>%
summarise(mean=mean(sim_1), sd=sd(sim_1)) %>%
pivot_longer(cols = c(mean, sd), names_to = "property")
simulated_properties_glm %>% ggplot(aes(y=value, x=property)) +
geom_violin() +
geom_point(data=real_properties, col="red")It looks like our model generates the most likely values of both mean and variance. That is good, but that is only 2 properties out of 30 data points, and we needed quite a lot of complex code to get there!
In addition, the linear model seems to perfect very well too:
simulated_properties_linearmodel <- map_dfr(.x = 1:1000, ~{simulate(lm1)}, .id = "simulation" ) %>%
group_by(simulation) %>%
summarise(mean=mean(sim_1), sd=sd(sim_1)) %>%
pivot_longer(cols = c(mean, sd), names_to = "property")
simulated_properties_linearmodel %>% ggplot(aes(y=value, x=property)) +
geom_violin() +
geom_point(data=real_properties, col="red")Fortunately, there is package that automates and make the assessment mathematically more correct and powerful, by focusing on various aspects of the distribution of real and simulated residuals (transformed with some clever math), so that we can assess all aspects of the model fit (not just the mean and variance).
library(DHARMa)We will generally use this package in two ways:
plot( simulateResiduals(glm1) )testDispersion(glm1)##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 1.135, p-value = 0.448
## alternative hypothesis: two.sided
Let’s see if the linear model performs well (NB: we already know lm1 is not great because of the output of resid_panel()):
plot( simulateResiduals(lm1) ) # pretty bad distribution of residuals## Warning in newton(lsp = lsp, X = G$X, y = G$y, Eb = G$Eb, UrS = G$UrS, L =
## G$L, : Fitting terminated with step failure - check results carefully
## qu = 0.5, log(sigma) = -2.380996 : outer Newton did not converge fully.
## qu = 0.5, log(sigma) = -2.303154 : outer Newton did not converge fully.
testDispersion(lm1) #no problem here##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 0.97814, p-value = 0.944
## alternative hypothesis: two.sided
Here the linear model is clearly not a good fit. A note of warning though: do not trust those tests and plots blindly! If you have good conceptual reasons to think a linear model is wrong for a dataset, don’t let non-significant p-values change your mind.
We will practice more how to use DHARMa in future sessions. For now know that it is a good option to test GLMs assumptions.
Reminder: we simulated those data, using a generative model corresponding to a binomial GLM:
set.seed(935)
number_observations <- 30
intercept <- 1
effect <- 2
predictor <- rnorm(n = number_observations,mean = 0,sd = 1)
latent <- intercept + effect*predictor
probability <- plogis(latent)
response <- sapply(probability, FUN=function(x){rbinom(1,1,x)})
data_binary <- tibble(predictor=predictor, response=response)
ggplot(data_binary, aes(x=predictor, y=response))+ geom_beeswarm(groupOnX = FALSE)A linear model (in red) shows many pathologies, whereas a binomial GLM does what we want:
ggplot(data_binary, aes(x=predictor, y=response))+ geom_beeswarm(groupOnX = FALSE) +
geom_smooth(method="lm", col="red", fill="red", fullrange=TRUE) +
geom_smooth(method="glm", col="blue", fill= "blue", method.args = list(family="binomial"), fullrange=TRUE) +
xlim(c(-4, 4))## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
The data set “voles_early.csv” contains records from the beginning of the survey of a wild rodent population. Among the variables is “survival”, a binary variable indicating whether an individual captured on a given year survived to the next year. We want to understand variation in survival and in particular whether there is natural selection on body mass through survival. Other variables than body mass probably structure the variation in survival, making less clear what model we should use. Therefore we start building our models trying to see if sex, age and other variables matter for survival.
cute snow vole
survdat_early <- read_csv("Data/voles_early.csv")## Rows: 809 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): id, sex, age, mother
## dbl (6): year, mass, survival, reproduction, fitness, StMass
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(survdat_early)## id year sex age
## Length:809 Min. :2006 Length:809 Length:809
## Class :character 1st Qu.:2007 Class :character Class :character
## Mode :character Median :2008 Mode :character Mode :character
## Mean :2008
## 3rd Qu.:2009
## Max. :2010
##
## mass survival reproduction fitness
## Min. :11.00 Min. :0.0000 Min. : 0.000 Min. : 0.000
## 1st Qu.:22.00 1st Qu.:0.0000 1st Qu.: 0.000 1st Qu.: 0.000
## Median :28.00 Median :0.0000 Median : 0.000 Median : 0.000
## Mean :30.46 Mean :0.2213 Mean : 1.135 Mean : 1.577
## 3rd Qu.:40.00 3rd Qu.:0.0000 3rd Qu.: 1.000 3rd Qu.: 2.000
## Max. :62.00 Max. :1.0000 Max. :15.000 Max. :15.000
## NA's :13
## StMass mother
## Min. :-1.92449 Length:809
## 1st Qu.:-0.86466 Class :character
## Median :-0.28657 Mode :character
## Mean :-0.04994
## 3rd Qu.: 0.86960
## Max. : 2.98925
## NA's :13
survdat_early %>% group_by(sex) %>%
summarize(p_survival=mean(survival), count_survival=sum(survival), count_death=sum(1-survival))## # A tibble: 2 × 4
## sex p_survival count_survival count_death
## <chr> <dbl> <dbl> <dbl>
## 1 Female 0.278 117 304
## 2 Male 0.160 62 326
vole_s_glm <- glm(survival ~ sex, data = survdat_early, family = "binomial")
summary(vole_s_glm)##
## Call:
## glm(formula = survival ~ sex, family = "binomial", data = survdat_early)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8070 -0.8070 -0.5901 -0.5901 1.9151
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.9549 0.1088 -8.777 < 2e-16 ***
## sexMale -0.7049 0.1762 -4.001 6.29e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 855.11 on 808 degrees of freedom
## Residual deviance: 838.51 on 807 degrees of freedom
## AIC: 842.51
##
## Number of Fisher Scoring iterations: 4
How to interpret the summary output? What are these parameters and how do they relate to the probability of year-to-year survival in females and males?
coef(vole_s_glm)## (Intercept) sexMale
## -0.9548538 -0.7049092
survdat_early %>% group_by(sex ) %>% summarize(mean=mean(survival)) ## # A tibble: 2 × 2
## sex mean
## <chr> <dbl>
## 1 Female 0.278
## 2 Male 0.160
Parameter estimates actually predict mean sex-specific survival, but it is not obvious because the GLM is fitted, not on the scale of probabilities, but on a transformed linear scale (on which effects are linear, like in a linear model).
All GLMs have such a transformation. It is called the “Link function”.
To go from a probability to the linear scale our GLM applied a logit link function:
\[ \mathrm{logit}(p) = \log(\frac{p}{1-p}) \]
To go from model predictions on the linear scale to the probability scale we apply the inverse link function, which is \[
\mathrm{logit}^{-1}(y) = \frac{1}{1+e^{-y}}
\] in R you can run this inverse logit function with plogis().
So, our model told us that the predicted survival for females (the intercept) was:
1/(1+exp(-coef(vole_s_glm)[1]))## (Intercept)
## 0.2779097
plogis(coef(vole_s_glm)[1])## (Intercept)
## 0.2779097
And the predicted survival for males was:
plogis(coef(vole_s_glm)[1] + coef(vole_s_glm)[2] )## (Intercept)
## 0.1597938
In a binomial GLM, when you want to calculate something on the probability scale (that makes more sense to the human brain) from parameter estimates you:
It does not work if you apply the inverse transformation on different parameters separately.
emmeans()It is a good idea to know how to back-transform link functions by hand: it forces you to understand how GLMs work, it is useful to simulate data and sometimes you really need to do calculations by hand. However, most of the time functions like emmeans() are a useful shortcut. You need to be control the argument type = (what scale do you want the prediction on? “link” means the scale of the linear model; “response” means scale of probability).
emmeans(vole_s_glm, ~sex, type="response")## sex prob SE df asymp.LCL asymp.UCL
## Female 0.278 0.0218 Inf 0.237 0.323
## Male 0.160 0.0186 Inf 0.127 0.200
##
## Confidence level used: 0.95
## Intervals are back-transformed from the logit scale
So, a GLM predicts probabilities, values between 0 and 1. But we observe only 0s and 1s. Does the GLM knows about that? Yes it does! The GLM relates probabilities to data in a very obvious way: for a predicted probability \(p\), the GLM thinks you should observe a proportion \(p\) of 1s and a proportion \(1-p\) of 0s. If it seems trivial, it is because it is. GLMs for binary data are really easy.
Other GLMs use more complex processes, so let us look more formally at how a binary GLM sees the world.
The distribution turning a probability \(p\) into 0s and 1s is the bernoulli distribution, a special case of the binomial distribution. We can draw random samples from the bernoulli distribution with rbinom(n=, size=1, prob=)
(bernoulli_sample <- rbinom(n = 1000, size = 1, prob = 0.3))## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 1
## [38] 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 1 0 1 0 1 1 0 0 0 1 1 0 0 1 0 1 0 1 0 0 1 0
## [75] 0 0 0 0 1 1 0 0 0 0 1 1 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0
## [112] 1 0 1 1 1 1 0 0 0 0 0 1 1 0 1 1 0 1 1 0 0 0 0 1 0 1 1 1 0 0 0 0 0 1 0 0 1
## [149] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 1 1 0 0 1
## [186] 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
## [223] 1 0 1 0 0 0 0 1 0 1 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 1 0 0
## [260] 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0
## [297] 0 1 1 0 0 0 0 0 1 0 1 0 0 1 0 1 0 1 1 1 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 1
## [334] 0 0 1 0 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 1 1 0 1 0 1 1 0 1 0 0 0
## [371] 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 1 0 0 1 1
## [408] 1 0 0 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 0 1 0 0 0 0 0 1
## [445] 0 1 0 1 0 0 1 0 1 1 0 0 1 0 1 0 0 0 0 1 1 0 0 0 0 1 1 0 1 1 0 1 0 1 1 0 0
## [482] 0 0 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 0 1 0 1 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 1
## [519] 0 0 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 1 0 1 1 1 1 1 0 0 0 1 0 0 1 0 0 0
## [556] 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0
## [593] 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 0 0 0
## [630] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0
## [667] 0 1 1 0 1 1 0 0 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 0 1
## [704] 0 1 1 1 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 0 0 0 0 0 1
## [741] 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0
## [778] 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0
## [815] 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0
## [852] 1 1 0 0 0 0 1 0 0 1 1 1 0 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0
## [889] 1 0 0 1 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0
## [926] 1 1 1 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 0 0 0 1 1 1 0 1 1 0 1 0 0 1 0 1 0 0
## [963] 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 1 1 0 0 0 1 1 0 1 0 1 1 1 1 0 0 0 1 0
## [1000] 0
mean(bernoulli_sample)## [1] 0.297
All GLMs use distributions that have some kind of relationship between their mean and their variance. For the bernoulli, the relationship is \(variance = mean*(1-mean)\), and the mean is expected to be equal to the probability.
var(bernoulli_sample)## [1] 0.209
mean(bernoulli_sample)*(1-mean(bernoulli_sample))## [1] 0.208791
That means that binary data are most variable for a probability of 0.5, and least variable for probabilities close to 0 or close to 1.
nb_rows <- 1000
variability_bernoulli <- tibble(lm_value= seq(-5,5, length.out = nb_rows),
probability = plogis(lm_value),
observation = rbinom(n = nb_rows, size = 1, prob = probability))
ggplot(variability_bernoulli, aes(x=lm_value, y=probability, col=probability)) +
geom_line()+
geom_point(inherit.aes = FALSE, aes(x=lm_value, y=observation))Let’s simulate survival for each sex based on the predicted survival probabilities:
pred_survival <- summary(emmeans(vole_s_glm, ~sex, type = "response"))
simulated_survival <- pred_survival %>%
group_by(sex) %>%
select(prob) %>%
summarise(survival = rbinom(n = 100, size = 1, prob = prob), prob=prob)## Adding missing grouping variables: `sex`
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
simulated_survival %>% group_by(sex) %>%
summarise(simulated_mean = mean(survival), expected_mean=mean(prob),
simulated_var = var(survival), expected_var=mean(prob)*(1-mean(prob)))## # A tibble: 2 × 5
## sex simulated_mean expected_mean simulated_var expected_var
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Female 0.31 0.278 0.216 0.201
## 2 Male 0.14 0.160 0.122 0.134
From previous examples we saw what a GLM is:
glm scales
\[ y_i = \alpha + \beta x_i \\ p_i = \mathrm{logit}^{-1}(y_i) \\ \mathrm{obs}_i \sim Bernoulli(p_i) \]
Let’s consider the relationship between mass and survival.
ggplot(survdat_early, aes(x = mass, y=survival)) +
geom_point(alpha=0.2)## Warning: Removed 13 rows containing missing values (geom_point).
Difficult to see much on this plot. Better to add some jitter/beeswarm and a glm fit:
ggplot(survdat_early, aes(x = mass, y=survival)) +
geom_beeswarm(groupOnX = FALSE) +
geom_smooth(method = "glm", method.args=list(family="binomial"))## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 13 rows containing non-finite values (stat_smooth).
## Warning: Removed 13 rows containing missing values (position_beeswarm).
vole_glm <- glm(survival ~ mass, data = survdat_early)
summary(vole_glm)##
## Call:
## glm(formula = survival ~ mass, data = survdat_early)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.2954 -0.2466 -0.2015 -0.1490 0.8736
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.336739 0.045189 7.452 2.41e-13 ***
## mass -0.003755 0.001403 -2.677 0.00759 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1718025)
##
## Null deviance: 137.64 on 795 degrees of freedom
## Residual deviance: 136.41 on 794 degrees of freedom
## (13 observations deleted due to missingness)
## AIC: 860.87
##
## Number of Fisher Scoring iterations: 2
So it looks like higher body mass corresponds to lower survival across the population.